Start with a short introduction about what is going on in this document
Introduction to PDCA, CAST and the methods contained within this document go here.
Explain some background info that is needed. Change header colours to purple?
Step through explanation of 2 stream sample data included. Plot both oil and gas production. Pick one well to apply across. Should this be before the arps parameters? Also show the uwi.matrix table where we summarize important parameters from each well
Key parameters for each production dataset have been summarized and compiled into a table. Cumulative months on production, initial oil and gas rates, peak oil and gas rates, peak month, and cumulative volumes can be seen for each UWI. These parameters will be used in subsequent calculations and help us prioritize wells as we begin to look at learning from wells with more production history later in PDCA process.
| UWI | OnProd Date | Initial Oil Rate | Peak Oil Rate | Peak Month | Cum Oil |
|---|---|---|---|---|---|
| 100/05-17-081-16W6/00 | 2018-04-01 | 26.123677 | 328.6216 | 3 | 62958.48 |
| 100/11-17-081-16W6/00 | 2018-05-01 | 219.027416 | 544.5717 | 2 | 95700.09 |
| 100/12-17-081-16W6/00 | 2018-05-01 | 265.308244 | 538.2610 | 2 | 82928.63 |
| 102/05-17-081-16W6/00 | 2018-04-01 | 6.646233 | 406.3637 | 3 | 39977.40 |
| 102/12-17-081-16W6/00 | 2018-05-01 | 135.210625 | 439.8464 | 2 | 60875.30 |
| 103/05-17-081-16W6/00 | 2018-04-01 | 7.631636 | 378.4579 | 3 | 58147.41 |
Arps is a common methodolgy for forecasting well production.
Explanation of that equation and a definition using code..?
We’ll define the modified hyperbolic equation in executable form. The equation will accept decline parameters and a time series across which it can evaluate production rate. First we’ll need a function to convert secant decline percentage into nominal form.
## to.nominal function converts effective decline rates to nominal decline rates
to.nominal<-function(De,b=0){
if(b==0){
a <- -log(1-De)
} else if(b==1){
a <- De/(1-De)
} else{
a <- (1/b) * (((1-De)^(-b))-1)
}
return(a)
}
## MH function is the modified hyperbolic function, where a single b value transitions
## to an exponential decline at a limiting decline rate
# Function will accept equation constants and an elapsed time value in days?
# Output will be rate at given time for given parameters
MH<-function(qi, b = 1.5, Di = 0.7, Dlim = 0.07, q.abd = 3, t = c("Enter In Days")){
#qi must be entered as a rate
#Di must be decimal in effective secant format as % year
#t must be entered as days
#Convert Di and Dlim to nominal
ai <- to.nominal(Di)
alim <- to.nominal(Dlim)
#Convert t to yearly time fraction
t <- t/365.25
#Add error handlers for true exponential decline and for if intial parameters
# bypass the exponential decline function.. I.e. if di<dlim.. error
#If di > dlim, use Di and no Dlim until end of time
if(b==0){
#Exponential Decline
q<-qi*exp(-ai*t)
} else{
#Calculate limiting decline rate
qlim <- qi*(alim/ai)^(1/b)
#Calculate limiting decline time, in years
tlim <- (((qi/qlim)^b)-1)/(b*ai)
#Calculate q assuming hyperbolic
q<-qi/((1+b*ai*t)^(1/b))
#Replace q when t > tlim
q[t > tlim] <- qlim*exp(-alim*(t[t>tlim] - tlim))
#Abandonment cutoff. When q < q.abd, q = 0
q[q < q.abd] <- 0
}
return(q)
}Explain the background of what it is and how it can be helpful in forecasting, especially to handle uncertainty. Define PDCA here! Explain briefly bayesian methods of achieving this? Priors and posteriors
Explain the place of the monte carlo method as our first try at estimating the posterior ditribution, showing parameter distributions as defining our prior and explaining how they are each sampled. Have inputs here for monte carlo, build any functions required, and run. Show nice clean histograms overlaid of the sampling for b, di, and qi based on the input criteria..
# Define distribution constraints for our Monte Carlo sampling of arps parameters
# Only oil values will be defined here.
## bi values
o.bi_min <- 0.5 #Initial oil b value min value
o.bi_max <- 2 #Initial oil b value max value
## di values; as decimals in secant. Will be converted to nominal in arps equation.
o.di_min <- 0.4 #Initial oil decline percentage min
o.di_max <- 0.98 #Initial oil decline percentage max
## qi multiplier or absolute values; Multipliers will be applied to peak intial rates.
o.qi_min <- 10 #Absolute min oil rate (bbls/d) for qi
o.qi_max <- 3 #Peak oil rate multiplier to define max oil qi (3 * peak_rate)
## dlim values; as decimals in secant.
o.dlim <- 0.07 #Limiting decline percentage used in modified hyperbolic equation
# Abandonment values; as rates in production units
o.abd <- 3
g.abd <- 10
# To handle multiple distribution types, let's define a "DefineDistribution" class
# The class will contain distribution information in a list, defined in S4
# Type = Normal, Lognormal, or Uniform; min_mean = min bound for uniform or mean of lognormal
# or normal; max_sd = max bound for uniform or standard deviation for lognormal or normal
setClass("DefineDistribution",slots = list(type="character",min_mean="numeric",max_sd="numeric"))
# Using this class, we will define uniform distributions for all of our inputs
b <- new("DefineDistribution",type = "Uniform",min_mean = o.bi_min,max_sd = o.bi_max)
di <- new("DefineDistribution",type = "Uniform",min_mean = o.di_min,max_sd = o.di_max)
dlim <- new("DefineDistribution",type = "Uniform",min_mean = o.dlim,max_sd = o.dlim)
#Let's pick the first well 100/05-17-081-16W6/00 for a peak rate to show the distribution of qi
o.qpeak <- matrix[1,"Oil.Peak.Rate"]
#Then define it's distribution
qi <- new("DefineDistribution",type = "Uniform",min_mean = o.qi_min, max_sd = o.qpeak*o.qi_max)Let’s visualize our defined distributions and overlay our samples from them. We’ll use a function named sample.any that is defined in the background to simply accept a DefineDistribution class object and randomly sample based on it’s type and inputs.
# Sample distributions using custom function sample.any
b.sample <- sample.any(n=10000,distribution = b)
di.sample <- sample.any(n=10000,distribution = di)
qi.sample <- sample.any(n=10000, distribution = qi)
# Note here Dlim is constant at 7% so we won't show it's distribution
# Display distributions using ggplot2 histograms
ggplot(data = data.frame(b = b.sample), aes(x = b)) +
geom_histogram(aes(y= ..density..), colour = "black", fill = "grey", alpha = .5) +
geom_density(alpha=.2, fill="#763F98") +
theme_classic() +
theme(text = element_text(size = rel(5)))
ggplot(data = data.frame(di = di.sample), aes(x = di)) +
geom_histogram(aes(y= ..density..), colour = "black", fill = "grey",alpha=.5) +
geom_density(alpha=.2, fill="#763F98") +
theme_classic() +
theme(text = element_text(size = rel(5)))
ggplot(data = data.frame(qi = qi.sample), aes(x = qi)) +
geom_histogram(aes(y= ..density..), colour = "black", fill = "grey",alpha=.5) +
geom_density(alpha=.2, fill="#763F98") +
theme_classic() +
theme(text = element_text(size = rel(5)))Uniform Distributions of b, di and qi
Explain here and have links to jump around document? Or explain when we get to it?
Explain process of forecast sythesis here, and show how it is carried out for the modified hyperbolic. Show arps substitution, and explain
Let’s combine and organize our sampled parameters in a data frame. We’ll use an equation-specific function to ensure completeness and repeatability for multi-well matching.
# The MHSample function will sample and combine all parameters based on their
# DefineDistribution classes
MHSample<-function(n=10000,qi,b,di,dlim){
#Sample relevant parameters for MH function: qi, b, di, dlim
#Inputs of class distribution created prior to function call, containing distribution information
# for each parameter
#Function will output a Monte Carlo matrix with n x m dimensions, where n is sample number
# and m is the number of parameters needed for MH equation
#Step through each parameter and call sample.any function
qi.out <- sample.any(n,qi)
b.out <- sample.any(n,b)
di.out <- sample.any(n,di)
dlim.out <- sample.any(n,dlim)
#Combine each returned column into dataframe for output
sampledParams <- data.frame("qi"=qi.out,"b"=b.out,"di"=di.out,"dlim"=dlim.out)
#Return sampled dataframe
return(sampledParams)
}
# Pass the previously defined distributions into the MH.sample function
# Note this is repeating our earlier sampling in a cleaner form
MH.sampled.parameters <- MHSample(n = 10000,qi = qi,b = b,di = di,dlim = dlim)
# Return in table, displaying first 10 rows only
kable(MH.sampled.parameters[c(1:10),], align = 'c') %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive",
full_width = F, position = "center"))| qi | b | di | dlim |
|---|---|---|---|
| 149.55661 | 1.3771858 | 0.9056401 | 0.07 |
| 702.25675 | 1.0070127 | 0.8109728 | 0.07 |
| 81.13509 | 0.7138906 | 0.9575267 | 0.07 |
| 417.07264 | 0.5976827 | 0.7614572 | 0.07 |
| 869.65404 | 1.2382736 | 0.5138593 | 0.07 |
| 413.75675 | 1.7013493 | 0.6649539 | 0.07 |
| 431.16627 | 0.5109597 | 0.5670319 | 0.07 |
| 619.74954 | 1.2612917 | 0.9717101 | 0.07 |
| 406.22448 | 0.5904154 | 0.4300673 | 0.07 |
| 877.48734 | 0.6632781 | 0.4752926 | 0.07 |
Using the combined parameter dataframe and the modified hyperbolic equation, we can define a function to use each row of the parameter dataframe as a Monte Carlo iteration and substitute it into the arps equation across the time interval of the current well. The function will create and return a mcIteration object for each iteration. This object is a list with slots for Iteration name, Parameters, and Synth. Some additional slots will be created and populated later as we approach cost and forecasting.
# A function, MH.synthesize, will be defined to carry out rate synthesis across the actuals
# time interval.
MH.synthesize<-function(parameters,q.abd,t){
#Parse up parameters and pass into MH function call
fc <- MH(parameters[["qi"]],parameters[["b"]],parameters[["di"]],parameters[["dlim"]],q.abd,t)
#Simplify parameters into a single string representation
parametersName<-paste(names(parameters),round(parameters,digits=2),sep = "=",collapse=" ")
#Create mcIteration imitation list to improve performance and ease of combining data
#Return mcIteration. Remaining list items to be appended by later functions
return(list("Iteration" = parametersName,"Parameters" = parameters,"Synth"=fc))
}
# Define a time interval of actuals to synthesize across, using 100/05-17-081-16W6/00
uwi<-"100/05-17-081-16W6/00"
# Subset production including only 100/05-17-081-16W6/00
singleWell.all <- formated_production$CD[which(formated_production$CD$Unique.Well.ID ==
uwi),]
# Find peak oil month for this uwi
uwi.pm <- matrix[["Oil.Peak.Month"]][which(matrix$eachUWI==uwi)]
# Refine production datatable to only include data after peak oil rate
singleWell<-singleWell.all[-c(1:(uwi.pm-1)),]
# Create time vector starting at 0 for peak oil month
synth.time <- (singleWell$CD.Month - uwi.pm) * 30.4 # Converted to daily
# Let's apply the MH.synthesize function in an apply call to execute across each row
# of the parameters dataframe
oil.synth<-apply(MH.sampled.parameters,1,MH.synthesize,q.abd = o.abd, t = synth.time)
# Print first object of oil.synth to show mcIteration object
print(oil.synth[[1]])## $Iteration
## [1] "qi=149.56 b=1.38 di=0.91 dlim=0.07"
##
## $Parameters
## qi b di dlim
## 149.5566092 1.3771858 0.9056401 0.0700000
##
## $Synth
## [1] 149.55661 125.68583 109.24546 97.13914 87.80085 80.34768 74.24143
## [8] 69.13402 64.78971 61.04274 57.77297 54.89099 52.32883 50.03381
## [15] 47.96446 46.08761 44.37643 42.80894 41.36697
Each mcIteration list, defining a Monte Carlo iteration and historical synthesis, can now be evaluated against the actual data to search for an appropriate model.
How do we compare the synthesis forecasts against our actuals and determine what is a good fit and what is a poor fit?
To compare each synthesis to the historical production data observed, we need to use functions that compare predicted points to actuals for each model. These functions are commonly referred to as loss functions. The most common loss function is the L2 loss function (insert link here?), which is popular due to its easy implementation, especially in linear models. However, the L2 loss is not as useful when comparing data with significant outliers, such as oil and gas production data, because it assigns too much weight to these outliers and rejects models that do not match them. We will need to look at more robust loss functions to find models that match the historical data as an engineer would.
Soft L1 loss is a robust, continuous variation of the common Huber Loss that gives less weight to outliers than a L2 loss, making it useful for fitting time series production data. Our Soft L1 definition will return the individual loss of each point and an average cost for all residuals input.
soft_L1<-function(r){
IndvCost <- 2*((1 + abs(r))^0.5 - 1)
Cost <- sum(IndvCost,na.rm = TRUE) / length(IndvCost)
return(list(IndvCost,Cost))
}
To apply the loss equation we’ll define a function to apply the Soft L1 loss to the residuals of each model match. The function will include an option to apply the loss function against the log residuals, an option that we’ll take advantage of due to the non-linear nature of production data. The function will populate the IndvCost and Cost slots of each mcIteration object, and once the function is defined, we can use these slots to visualize each model’s fit.
## applyLoss function will be used to accept a list of mcIteration objects, an actuals dataframe, a loss function to apply, and an option to use log residuals or absolute residuals, then populate the cost and avg_cost slots of the mcIteration objects and return them.
applyLoss<-function(iterationObject = c("mcIteration List"),
actuals = c("Matrix with actual data"), loss.function = soft_L1,
log.residuals = TRUE){
# Calculate residuals from model synthesis to actuals. If log.residuals is TRUE,
# return the log of the residuals
if (log.residuals == TRUE){
r <- log(actuals) - log(iterationObject$Synth)
} else {
r <- actuals - iterationObject$Synth
}
# Apply loss function to residuals. Returns a list of IndvCost and Cost.
rho <- loss.function(r)
# Append IndvCost and avg_cost (named cost) to the mIteration imitation list calling from
# rho list slots
iterationObject$IndvCost <- rho[[1]]
iterationObject$Cost <- rho[[2]]
# return iterationObject
return(iterationObject)
}
# Use the applyLoss function in an lapply call across each object in oil.synth
oil.synth <- lapply(oil.synth, applyLoss, actuals = singleWell[["Oil.Rate"]],
loss.function = soft_L1, log.residuals = FALSE)
# Repeat for log residuals
oil.synth.log <- lapply(oil.synth, applyLoss, actuals = singleWell[["Oil.Rate"]],
loss.function = soft_L1, log.residuals = TRUE)
# Updated mcIteration object
print(oil.synth[[1]])## $Iteration
## [1] "qi=149.56 b=1.38 di=0.91 dlim=0.07"
##
## $Parameters
## qi b di dlim
## 149.5566092 1.3771858 0.9056401 0.0700000
##
## $Synth
## [1] 149.55661 125.68583 109.24546 97.13914 87.80085 80.34768 74.24143
## [8] 69.13402 64.78971 61.04274 57.77297 54.89099 52.32883 50.03381
## [15] 47.96446 46.08761 44.37643 42.80894 41.36697
##
## $IndvCost
## [1] 24.8376599 19.5115191 14.6114603 10.3052483 8.8590019 7.5320440
## [7] 10.1208577 6.8127360 5.2085792 3.8434490 2.0350757 2.7126292
## [13] 0.4156658 2.5246885 2.2184544 2.2649887 3.7291126 3.6118306
## [19] 3.6339453
##
## $Cost
## [1] 7.094155
With cost calculated using the Soft L1 loss function applied to the standard and log residuals for every model iteration, let’s use the average cost to find our “best fit” model out of the 10,000 iterations. We’ll plot the model against the actual oil rate, coloured by individual cost for both residual method to look at how each handles outliers.
# Use lapply and do.call to combine the average cost of every model object
# into a single dataframe
oil.cost <- do.call(rbind,lapply(oil.synth,function(x){
return(x$Cost)
}))
# Repeat for log residuals
oil.cost.log <- do.call(rbind,lapply(oil.synth.log,function(x){
return(x$Cost)
}))
# Find index of best fit model using log soft_L1 residuals
oil.bf.log <- which.min(oil.cost.log)
# Plot this "best fit" against actuals and colour by standard and log residual cost
# To plot, we'll utilize some custom plotly functions defined in plotResources external file
oil.synth.plot <- plotly.multimodel.indvcostcolor(actuals.x = singleWell[["CD.Month"]],
actuals.y = singleWell[["Oil.Rate"]],
iterationObject = oil.synth[oil.bf.log],
scales = "semi-log")
oil.synth.plot <- layout(oil.synth.plot, xaxis = list(title = "Cal-Day Month"),
yaxis = list(title = "Oil Rate (bbls/d)"),
title = "Soft L1 Cost of Residuals")
oil.synth.plot.log <- plotly.multimodel.indvcostcolor(actuals.x = singleWell[["CD.Month"]],
actuals.y = singleWell[["Oil.Rate"]],
iterationObject = oil.synth.log[oil.bf.log],
scales = "semi-log")
oil.synth.plot.log <- layout(oil.synth.plot.log, xaxis = list(title = "Cal-Day Month"),
yaxis = list(title = "Oil Rate (bbls/d)"),
title = "Soft L1 Cost of Log Residuals")
# Combine into a subplot
combined<-subplot(oil_prod,gas_prod,nrows = 2,shareY = TRUE,titleX = FALSE,titleY = TRUE)
combined<-layout(combined,showlegend=FALSE,title="Oil & Gas Production",xaxis2=list(title="Cal-Day Month"))
combined.cost <- subplot(oil.synth.plot,oil.synth.plot.log,nrows = 1,titleX = TRUE,titleY=FALSE)
combined.cost <- layout(combined.cost,title = "Soft L1 Cost of Absolute vs. Log Residuals",
yaxis=list(title="Oil Rate (bbls/d)"))
#IMPROVE TOOLTIP???
hide_colorbar(combined.cost)Comparing the absolute residual cost in the left plot against the log residual cost in the right plot, there is similar relative spread in the residuals and both cost the peak rate similarily, but the log residuals does a better job at costing the sub peak at month 9. We will carry log residual cost through the remainder of the CAST methology as it capture the non-linearity of the production data better. One important thing to notice before moving on is that the Soft L1 function gives a high cost to the peak oil rate and does not match it perfectly as a result. This may be concerning when compared to traditional DCA matches, but given that early time production data can be far from steady-state flowing behaviour, it is a useful feature to maintain.
Using the log residuals and the Soft L1 cost function, let’s find the top 100 models from our initial 10,000 iterations and see how they describe the ranges of our actual oil rates. We’ll colour them by average cost to compare the forecasts.
# Find and plot top 100 (top 1%) models using the ggplot resource. Colour by cost.
oil.synth.P1 <- which(oil.cost.log <= quantile(oil.cost.log,0.01))
#Sort from highest cost to lowest for plotting visuals
oil.synth.P1 <- oil.synth.P1[order(oil.cost.log[oil.synth.P1],decreasing = TRUE)]
oil.plot.P1 <- ggplot.multimodel.costcolor(actuals.x = singleWell[["CD.Month"]],
actuals.y = singleWell[["Oil.Rate"]],
iterationObject = oil.synth.log[oil.synth.P1],
scales = "semi-log")
# Add titles and output
oil.plot.P1 <- oil.plot.P1 +
xlab("Cal-Day Month") +
ylab("Oil Rate (bbls/d)")
oil.plot.P1Top 10% of Models Using Soft L1 Loss
We can see that the top 100 forecasts do a good job of covering the range of actual oil rates, with the lowest cost forecasts tightly grouped around the actuals. We’ve shown we can use Monte Carlo sampling paired with a loss function to match historical production! Before moving on let’s note that the majority of the forecasts are under-predicting the intial rate. This is primarily driven by the high cost of the peak oil rate using the Soft L1, but could also be improved through our input distributions. This is something we can improve upon later.
Explain this loss function, where it’s from, and show a comparison of application to the Soft L1 loss
The Soft L1 loss function does a decent job at finding a best fit model, but is there a better loss function that can be used? Let’s look at a more complete loss function defined in SPE-174784-PA (ADD LINK HERE AS REFERENCE) by Fulford et al. takes a similar approach, but .. EXPLAIN HERE
# As our Soft L1 was defined, pdca_loss will slots for IndvCost and Cost
pdca_loss<-function(r, Eps_mean_min = 0, Eps_std_min = 0){
IndvCost <- r^2 # pdca loss simply uses L2 loss for individual cost
Eps_mean <- mean(r,na.rm = TRUE) #Mean of residuals
Eps_std <- sd(r,na.rm=TRUE) #Std of residuals
Cost <- ((Eps_mean - Eps_mean_min)^2/.1)^2+((Eps_std - Eps_std_min)^2/.01)^2
return(list(IndvCost,Cost))
}Just as we did with the Soft L1, let’s use the PDCA loss function to find the top 100 forecasts from the same 10,000 iterations. We’ll see the minimum mean and minimum standard deviation of epsilon to 0, assuming a perfect model. Again we’ll colour by average cost and compare to the Soft L1 method.
# Apply PDCA loss function
oil.synth.PDCA <- lapply(oil.synth, applyLoss, actuals = singleWell[["Oil.Rate"]],
loss.function = pdca_loss, log.residuals = TRUE)
# Use lapply and do.call to combine each object loss into a single dataframe
oil.cost.PDCA <- do.call(rbind,lapply(oil.synth.PDCA,function(x){
return(x$Cost)
}))
# Find and plot top 100 (top 1%) models using the ggplot resource. Colour by PDCA cost.
oil.synth.pdcaP1 <- which(oil.cost.PDCA <= quantile(oil.cost.PDCA,0.01,na.rm = TRUE))
#Sort from highest cost to lowest for plotting visuals
oil.synth.pdcaP1 <- oil.synth.pdcaP1[order(oil.cost.PDCA[oil.synth.pdcaP1],decreasing = TRUE)]
oil.plot.pdcaP1 <- ggplot.multimodel.costcolor(actuals.x = singleWell[["CD.Month"]],
actuals.y = singleWell[["Oil.Rate"]],
iterationObject = oil.synth.PDCA[oil.synth.pdcaP1],
scales = "semi-log")
# Add titles and output
oil.plot.pdcaP1 <- oil.plot.pdcaP1 + ggtitle("PDCA Loss Function") +
xlab("Cal-Day Month") +
ylab("Oil Rate (bbls/d)")
oil.plot.P1 + ggtitle("Soft L1 Loss Function")
oil.plot.pdcaP1Comparing Top 10% of Models Using Soft L1 and PDCA Loss
Summarize findings in these two charts and how we’re going to carry PDCA loss going forward in the rest of the evaluation / process as it provides a better probabilistic representation
This has already been done. Still need?
Show best fit using either loss function and explain how this may be a good fit for wells with good production history, we still want to create a probabilistic outcome for each well to capture forecast uncertainty
Show top 10% of functions, etc. using either loss function, and show that the Monte Carlo method returns adequate forecasts for good prod history, but not short history wells.. What’s next?? MCMC
Explain concept or link above if put there.
Show MC method as burn-in to determine best fit forecast to apply PDCA loss function to
Creation of P10, P50, Avg, and P90 forecasts
Darcy Redpath
Petronas Canada
dredpath@petronascanada.com